Search CORE

18 research outputs found

Multiplication tenseur–vecteur haute performance sur des machines à memoire partagée

Author: Pawłowski Filip
Uçar Bora
Yzelman Albert-Jan
Publication venue: HAL CCSD
Publication date: 01/05/2019
Field of study

Tensor–vector multiplication is one of the core components in tensor computations. We have recently investigated high performance, single core implementation of this bandwidth-bound operation. In this work, we investigate efficient, shared memory algorithms to carry out this operation. Upon carefully analyzing the design space, we implement a number of alternatives using OpenMP and compare them experimentally. Experimental results on up to 8 socket systems show near peak performance for the proposed algorithms.La multiplication tenseur–vecteur est l’un des composants essentiels des calculs de tenseurs. Nous avons récemment étudié cette opération, qui consomme la bande passante, sur une plateforme séquentielle. Dans ce travail, nous étudions des algorithmes efficaces pour effectuer cette opérationsur des machines à mémoire partagée. Après avoir soigneusement analysé les différentes alternatives, nous mettons en œuvre plusieurs d’entre elles en utilisant OpenMP, et nous les comparons expérimentalement. Les résultats expérimentaux sur un à huit systèmes de sockets montrent une performance quasi maximale pour les algorithmes proposé

INRIA a CCSD electronic archive server

A native tensor-vector multiplication algorithm for high performance computing

Author: Beltran Querol Vicenç
Martínez Ferrer Pedro José
Yzelman Albert-Jan Nicholas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2022
Field of study

Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor-vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operations. This paper proposes an open-source TVM algorithm which is much simpler and efficient than previous approaches, making it suitable for integration in the most popular BLAS libraries available today. Our algorithm has been written from scratch and features unit-stride memory accesses, cache awareness, mode obliviousness, full vectorization and multi-threading as well as NUMA awareness for non-hierarchically stored dense tensors. Numerical experiments are carried out on tensors up to order 10 and various compilers and hardware architectures equipped with traditional DDR and high bandwidth memory (HBM). For large tensors the average performance of the TVM ranges between 62% and 76% of the theoretical bandwidth for NUMA systems with DDR memory and remains independent of the contraction mode. On NUMA systems with HBM the TVM exhibits some mode dependency but manages to reach performance figures close to peak values. Finally, the higher-order power method is benchmarked with the proposed TVM kernel and delivers on average between 58% and 69% of the theoretical bandwidth for large tensors.This work was supported in part by MCIN/AEI and ESF under Grant RYC2019-027592-I, and in part by the HPC Technology Innovation Lab, a Barcelona Supercomputing Center and Huawei research cooperation agreement (2020).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Open Problems in (Hyper)Graph Decomposition

Large networks are useful in a wide range of applications. Sometimes problem instances are composed of billions of entities. Decomposing and analyzing these structures helps us gain new insights about our surroundings. Even if the final application concerns a different problem (such as traversal, finding paths, trees, and flows), decomposing large graphs is often an important subproblem for complexity reduction or parallelization. This report is a summary of discussions that happened at Dagstuhl seminar 23331 on "Recent Trends in Graph Decomposition" and presents currently open problems and future directions in the area of (hyper)graph decomposition

arXiv.org e-Print Archive

High-level strategies for parallel shared-memory sparse matrix-vector multiplication

Author: Roose Dirk
Yzelman Albert-Jan
Publication venue: Department of Computer Science, KU Leuven
Publication date: 01/06/2012
Field of study

The sparse matrix--vector multiplication is an important kernel, but is hard to efficiently execute even in the sequential case. The problems —namely low arithmetic intensity, inefficient cache use, and limited memory bandwidth— are magnified as the core count on shared-memory parallel architectures increases. Existing techniques are discussed in detail, and categorised chiefly based on their distribution types. Based on this new parallelisation techniques are proposed. The theoretical scalability and memory usage of the various strategies are analysed, and experiments on multiple NUMA architectures confirm the validity of the results. One of the newly proposed methods attains the best average result in experiments, in one of the experiments obtaining a parallel efficiency of 90 percent.nrpages: 26status: publishe

Lirias

High-level strategies for parallel shared-memory sparse matrix-vector multiplication

Author: Roose Dirk
Yzelman Albert-Jan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2012
Field of study

Lirias

A Cache-Oblivious Sparse Matrix–Vector Multiplication Scheme Based on the Hilbert Curve

Author: Bisseling Rob H
Yzelman Albert-Jan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

The sparse matrix–vector (SpMV) multiplication is an important kernel in many applications. When the sparse matrix used is unstructured, however, standard SpMV multiplication implementations typically are inefficient in terms of cache usage, sometimes working at only a fraction of peak performance. Cache-aware algorithms take information on specifics of the cache architecture as a parameter to derive an efficient SpMV multiply. In contrast, cache-oblivious algorithms strive to obtain efficiency regardless of cache specifics. In earlier work in this latter area, Haase et al. (2007) use the Hilbert curve to order nonzeroes in the sparse matrix. They obtain speedup mainly when multiplying against multiple (up to eight) right-hand sides simultaneously. We improve on this by introducing a new datastructure, called Bi-directional Incremental Compressed Row Storage (BICRS). Using this datastructure to store the nonzeroes in Hilbert order, speedups of up to a factor two are attained for the SpMV multiplication y = Ax on sufficiently large, unstructured matrices.status: publishe

Lirias

Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods

Author: Bisseling Rob H
Yzelman Albert-Jan
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 31/07/2009
Field of study

In this article, we introduce a cache-oblivious method for sparse matrix–vector multiplication. Our method attempts to permute the rows and columns of the input matrix using a recursive hypergraph-based sparse matrix partitioning scheme so that the resulting matrix induces cache-friendly behavior during sparse matrix–vector multiplication. Matrices are assumed to be stored in row-major format, by means of the compressed row storage (CRS) or its variants incremental CRS and zig-zag CRS. The zig-zag CRS data structure is shown to fit well with the hypergraph metric used in partitioning sparse matrices for the purpose of parallel computation. The separated block-diagonal (SBD) form is shown to be the appropriate matrix structure for cache enhancement. We have implemented a run-time cache simulation library enabling us to analyze cache behavior for arbitrary matrices and arbitrary cache properties during matrix–vector multiplication within a k-way set-associative idealized cache model. The results of these simulations are then verified by actual experiments run on various cache architectures. In all these experiments, we use the Mondriaan sparse matrix partitioner in one-dimensional mode. The savings in computation time achieved by our matrix reorderings reach up to 50 percent, in the case of a large link matrix.status: publishe

Lirias

SIAM’s CSC Workshop Series Marks 10th Year

Author: Uçar Bora
Yzelman Albert-Jan N.
Publication venue: HAL CCSD
Publication date: 01/12/2014
Field of study

A news article on the sixth SIAM Workshop on Combinatorial Scientific ComputingThe 2014 SIAM Workshop on Combinatorial Scientific Computing, held at the École Normale Supérieure in Lyon, France, July 21–23, was the sixth in a series that began ten years ago in San Francisco. True to CSC tradition, the 2014 workshop program comprised a wide range of combinatorial topics arising from many corners of scientific computing. Presenting recent results were a diverse set of speakers––PhD students, postdocs, early-career researchers, and well-established researchers, from academia, national laboratories, and industry; speakers from industry accounted for more than 10% of the talks. We give a summary of the meeting

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Multiplication tenseur–vecteur haute performance sur des machines à memoire partagée

Author: Pawłowski Filip
Uçar Bora
Yzelman Albert-Jan
Publication venue: HAL CCSD
Publication date: 01/05/2019
Field of study

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

High performance tensor-vector multiplication on shared-memory systems

Author: Pawlowski Filip
Uçar Bora
Yzelman Albert-Jan
Publication venue: HAL CCSD
Publication date: 08/09/2019
Field of study

International audienc

Hal-Diderot